T5gemma2 #41834

bzhangGo · 2025-10-23T22:58:28Z

What does this PR do?

Add support for T5Gemma2 with multi-modal and long-context capability.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2025-10-28T15:10:49Z

Hey! Is this a pre-release model? I don't see the checkpoints like google/t5gemma-2-270m anywhere

HuggingFaceDocBuilderDev · 2025-10-28T15:19:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bzhangGo · 2025-10-28T15:39:48Z

Hey! Is this a pre-release model? I don't see the checkpoints like google/t5gemma-2-270m anywhere
yeah, coming soon

Rocketknight1 · 2025-10-31T11:54:25Z

cc @ArthurZucker in that case!

vasqu

Just an initial review from my side

3 main issues I would have are

We definitely should be able to have a SWA for bidirectional masks directly in the utils.
The process fn to process pixel values should not be passed around imo, it should be handled a level above
We should have one level for the encoder, decoder models to be wrapped under language model

src/transformers/models/t5gemma/__init__.py

src/transformers/models/t5gemma2/modular_t5gemma2.py

bzhangGo · 2025-10-31T17:01:36Z

@vasqu Thanks for the comments. Re the major issues,

We definitely should be able to have a SWA for bidirectional masks directly in the utils.

I agree with this move but I think this is better done by transformers side since it's a big behavior change. wdyt?

The process fn to process pixel values should not be passed around imo, it should be handled a level above

This relates to how the transformers handle generation for encoder-decoder models. Two abnormal behaviors:

the cache was created by generationmix, not the model: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L1924
the encoder outputs was created from encoder directly, not the model: https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L902.

This is the main reason that we have some wired designs in T5Gemma2, including the dynamic adjustment for sliding window size, and the special handle of the vision preprocessor. Please let me know if you have any suggestions!

We should have one level for the encoder, decoder models to be wrapped under language model

I'm not sure if it's a good idea to have another level of wrapper for encoder-decoders, as it's common to put encoders and decoders into the model jointly, like T5/Bart/T5Gemma.

vasqu

The key change should be how we handle caches and the encoder with vision preprocessing (get image features etc).

Regarding generation related issue, for now we should override the respective functions in the generation mixin from which we inherit. I agree that we ideally have proper logic in our code, but I want to postpone this for now and fix this properly in the future; small overrides are fine and we always encounter these here and there already.

src/transformers/models/t5gemma2/modular_t5gemma2.py

vasqu

Thank you, looking already pretty good. There are no big issues tbh, it's just few smaller issues for standards

src/transformers/models/t5gemma2/modular_t5gemma2.py

tests/models/t5gemma2/test_modeling_t5gemma2.py

vasqu

Thanks a lot for iterating! I left some last comments which are mostly things to shorten / simplify things.

If we could remove some test overrides, that would be awesome. Atm, it's a lot which is unideal but also not the world.

src/transformers/models/t5gemma2/modular_t5gemma2.py

tests/models/t5gemma2/test_modeling_t5gemma2.py

vasqu · 2025-11-17T16:56:24Z

cc @ArthurZucker @Cyrilvallez for core maintainer

This is an encoder-decoder model with multimodal capabilities. To properly interact with our generation pipeline, the vision backbone is within the encoder.

1. Override _prepare_cache_for_generation to take care of cross-attention cache. 2. Move vision preprocessing from main model to encoder. 3. Clean and fix bugs in modular model.

github-actions · 2025-11-17T21:53:58Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, t5gemma, t5gemma2

bzhangGo force-pushed the t5gemma2 branch from 86c6848 to d6ad826 Compare October 25, 2025 21:07

vasqu reviewed Oct 31, 2025

View reviewed changes

bzhangGo force-pushed the t5gemma2 branch from 47dcdfd to 2275a9f Compare October 31, 2025 16:55

vasqu reviewed Nov 10, 2025

View reviewed changes

bzhangGo force-pushed the t5gemma2 branch from 8133938 to d6ae10d Compare November 12, 2025 19:21

vasqu reviewed Nov 13, 2025

View reviewed changes

bzhangGo force-pushed the t5gemma2 branch from d6ae10d to 8df9404 Compare November 14, 2025 21:09

vasqu approved these changes Nov 17, 2025

View reviewed changes

bzhangGo and others added 16 commits November 17, 2025 21:42

Fix small bug in T5Gemma 1 in __init__

1d9014c

Add t5gemma2 model & configurations.

7f6ea93

Add auto support

e100aec

Add test case.

a1f48dd

Add doctree.

e9dc817

Update positional embeddings to match latest update.

d1932d5

Style fix & add use of final_logit_softcapping for attributes check.

469b244

Update tests and embedding design.

a96f7e1

Add t5gemma2 to image-text-to-text category.

9dea7c3

Add T5Gemma2 doc.

632432c

remove unused imports.

53469e2

minor update following comments.

7a7c9c0

minor style fixes.

0123d50

fix config.

c58b505

Update T5Gemma2 following Anton's comments:

5a99d9e

1. Override _prepare_cache_for_generation to take care of cross-attention cache. 2. Move vision preprocessing from main model to encoder. 3. Clean and fix bugs in modular model.

Add T5Gemma2VisionConfig.

2470064

bzhangGo added 6 commits November 17, 2025 21:42

Minor updates.

e5db9b9

fix style

932a33d

re-structure vision encoder and minor fixes.

7caf6d2

fix parameter tying.

9b6b598

remove several unnecessary codes and fix small bugs.

a02bd87

update and fix init.

cf85951

bzhangGo force-pushed the t5gemma2 branch from be6b604 to cf85951 Compare November 17, 2025 21:52

T5gemma2 #41834

Are you sure you want to change the base?

T5gemma2 #41834

Conversation

bzhangGo commented Oct 23, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Oct 28, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 28, 2025

Uh oh!

bzhangGo commented Oct 28, 2025

Uh oh!

Rocketknight1 commented Oct 31, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bzhangGo commented Oct 31, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone